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(57) A phoneticizer converts spelled words or 
names into one or an n-best number of phonetic tran- 
scriptions. The n-best transcriptions may be generated 
from a single transcription using a confusion matrix. 
These n-best transcriptions are then transformed into 
hybrid units. Preferably only the most frequently en- 
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countered units are stored as syllables, with the remain- 
der being stored as smaller units such as demi-syllables 
or phonemes. Voice input is then used to rescore the n- 
best transcriptions and these are stored preferably as 
speaker-independent, similarity-based hybrid units con- 
catenated into a string representing the spelled word. 



Small footprint language and vocabulary independent word recognizer using registration 
by word spelling 



yes 



•no 



(+2L='R t ' 



ML = 'P t ?> 



no 



no 



yes 



CM 
< 

O 

CO 

00 
CD 

o 

CL 
LU 



-=>0.60 

eh=>0.28 

ih=>0.09 

iy=>0.02 

(etc.) 



= VOW?^ 



no 



-=>0.98 
]=>0.01 
(etc.) 



I L- CONS 



yes^ 







yes / 


\ no 


eh=X).63 




-=X).27 


ih=X).18 




eh=>0;25 


iy=X).n 




ax=X).24 


ax=>0.03 




ih=XX13 


ey=>0.02 




t=X).03 


(etc.) 




(etc.) 



FIG. 4 




-=X).83 
l=>0.09 
iy=>0.05 
ey=>0.02 
t=>0.01 



iy=X).51 
-=>0.30 
eh=>0.08 
ih=>0.05 
ey=>0.03 
etc.) 



Printed by Jouve, 75001 PAWS (FR) 



1 



EP 0 984 430 A2 



2 



Description 

Background and Summary of the Invention 

[0001] The present invention relates generally to 
speech recognizers.. More particularly, the invention re- 
lates to a small memory footprint recognizer suitable for 
embedded applications where available memory and 
processor resources are limited. New words are added 
to the recognizer lexicon by entry as spelled words that 
are then converted into phonetic transcriptions and sub- 
sequently into syllabic transcriptions for storage in the 
lexicon. 

[0002] The trend in consumer products today is to in- 
corporate speech technology to make these products 
easier to use. Many consumer products, such as cellular 
telephones, offer ideal opportunities to exploit speech 
technology, however they also present a challenge in 
that memory and processing power is often limited. Con- 
sidering the particular case of using speech recognition 
technology for voice dialing of cellular telephones, the 
embedded recognizer will need to fit into a relatively 
small amount of non-volatile memory, and the random 
access memory used by the recognizer in operation is 
also fairly limited. 

[0003] To economize memory usage, the typical em- 
bedded recognizer system will have a very limited, often 
static, vocabulary. The more flexible large vocabulary 
recognizers that employ a phonetic approach combined 
with statistical techniques, such as Hidden Markov Mod- 
el (HMM), use far too much memory for many embed- 
ded system applications. Moreover, the more powerful, 
general purpose recognizers model words on subword 
units, such as phonemes that are concatenated to de- 
fine the words models. Frequently these models are 
context-dependent. They store different versions of 
each phoneme according to what neighboring pho- 
nemes precede and follow (typically stored as tri- 
phones). For most embedded applications there are 
simply too many triphones to be stored in a small 
amount of memory. 

[0004] Related to the memory constraint issue, many 
embedded systems have difficulty accommodating a us- 
er who wishes to add new words to the lexicon of rec- 
ognized words. Not only is lexicon storage space limit- 
ed, but the temporary storage space needed to perform 
the word addition process is also limited. Moreover, in 
embedded systems, such as the cellular telephone, 
where the processor needs to handle other tasks, con- 
ventional lexicon updating procedures may not be pos- 
sible within a reasonable length of time. User interaction 
features common to conventional recognizer technolo- 
gy are also restricted. For example, in a conventional 
recognizer system, a guidance prompt is typically em- 
ployed to confirm that a word uttered by the user was 
correctly recognized. In conventional systems the guid- 
ance prompt may be an encoded version of the users 
recorded speech. In some highly constrained embed- 



ded systems, such guidance prompts may not be prac- 
tical because the encoded version of the recorded 
speech (guidance voice) requires too much memory. 
[0005] The present invention addresses the above 
5 problems by providing a small memory footprint recog- 
nizer that may be trained quickly and without large mem- 
ory consumption by entry of new words through spelling. 
The user enters characters, such as through a keyboard 
or a touch-tone pad of a telephone, and these charac- 

10 ters are processed by a phoneticizer that uses decision 
trees or the like to generate a phonetic transcription of 
the spelled word. If desired, multiple transcriptions can 
be generated by the phoneticizer, yielding the n-best 
transcriptions. Where memory is highly constrained, the 

15 n-best transcriptions can be generated using a confu- 
sion matrix that calculates the n-best transcriptions 
based on the one transcription produced by the phoneti- 
cizer. These transcriptions are then converted into an- 
other form based on hybrid sound units described next. 

20 [0006] The system employs a hybrid sound unit for 
representing words in the lexicon. The transcriptions 
produced by the phoneticizer are converted into these 
hybrid sound units for compact storage in the lexicon. 
The hybrid units can comprise a mixture of several dif- 

25 ferent sound units, including syllables, demi-sylables, 
phonemes and the like. Preferably the hybrid units are 
selected so that the class of larger sound units (e.g., 
syllables) represent the most frequently used sounds in 
the lexicon, and so that one or more classes of smaller 

30 sound units (e.g. demi-syllables and phonemes) repre- 
sent the less frequently used sounds. Such a mixture 
gives high recognition quality associated with larger 
sound units without the large memory requirement. Co- 
articulated sounds are handled better by the larger 

35 sound units, for example. 

[0007] Using a dictionary of hybrid sound units, the 
transcriptions produced by phonetic transcription are 
converted to yield the n-best hybrid unit transcriptions. 
If desired, the transcriptions can be rescored at this 

40 stage, using decision trees or the like. Alternatively, the 
best transcription (or set of n-best transcriptions) is ex- 
tracted through user interaction or by comparison to the 
voice input supplied by the user (e.g., through the mi- 
crophone of a cellular telephone). 

45 [0008] A word template is then constructed from the 
extracted best or n-best transcriptions, by selecting pre- 
viously stored hybrid units from the hybrid unit dictionary 
and these units are concatenated to form a hybrid unit 
string representing the word. Preferably the hybrid units 

50 are represented using a suitable speaker-independent 
representation; a phone similarity representation is 
presently preferred although other representations can 
be used. The spelled word (letters) and the hybrid unit 
string (concatenated hybrid units) are stored in the lex- 

55 icon as a new entry. If desired, the stored spelled word 
can be used as a guidance prompt by displaying it on 
the LCD display of the consumer product. 
[0009] The recognizer of the invention is highly mem- 
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ory efficient. In contrast with the large lexicon of HMM 
parameters found in conventional systems, the lexicon 
of the invention is quite compact. Only a few bytes are 
needed to store the spelled word letters and the asso- 
ciated hybrid unit string. Being based on hybrid units the 5 
word model representation is highly compact and the 
hybrid unit dictionary used in word template construction 
is also significantly smaller than dictionaries found in 
conventional systems. 

[0010] For a more complete understanding of the in- 
vention, its objects and advantages, referred to the fol- 
lowing specification and to the accompanying drawings. 

Brief Description of the Drawings 

[0011] 

Figure 1 is a block diagram of one embodiment of 
the recognizer in accordance with the invention; 
and 

Figure 2 is a flow chart diagram illustrating a pres- 
ently preferred syllabification process; 
Figure 3 is a block diagram illustrating the presently 
preferred phoneticizer using decision trees; 
Figure 4 is a tree diagram illustrating a letter-only 
tree; and 

Figure 5 is a tree diagram illustrating a mixed tree 
in accordance with the invention. 

Detailed Description of the Preferred Embodiments 

[0012] Referring to Figure 1 , the speech recognizer of 
the invention will be described in the context of a typical 
consumer product application, in this case a cellular tel- 
ephone application. It will, of course, be appreciated that 
the principles of the invention can be applied in a variety 
of different applications and are therefore not limited to 
the cellular telephone application illustrated here. 
[0013] The recognizer system stores entries for all 
words that it can recognize in a lexicon. Unlike conven- 
tional recognizers, however, this system represents 
each word as a string of concatenated hybrid units. In 
the case of the cellular telephone application some of 
the words in the lexicon may represent the names of 
parties to whom telephone numbers have been as- 
signed by the user. Thus the user can speak the name 
of the party into the cellular telephone device 1 2 and the 
system will then recognize the spoken name and look 
up the associated telephone number so that the call can 
be placed. 

[0014] In order to better understand how the recog- 
nizer of the invention represents entries in its lexicon, a 
description of the presently preferred word registration 
system will now be presented. The word registration 
system is the mechanism by which new words are add- 
ed to the lexicon through word spelling entry. 
[0015] To add a new word to the lexicon, the user 
spells the word, the spelled letters representing the new 



word input. Any suitable means can be used to input the 
letters of the spelled word. Hardware devices such as 
keyboards or touch-tone keypads may be used. Voice 
recognition can also be used, where the recognizer, it- 
self converts the spoken letters into alphanumeric char- 
acters. 

[0016] The spelled word entered by the user is proc- 
essed by phoneticizer 14. Phoneticizer 14 converts the 
spelled word letters into one or more phonetic transcrip- 
tions. The presently preferred embodiment uses deci- 
sion trees to perform the letter to phoneme conversion. 
The presently preferred phoneticizer uses one decision 
tree per letter of the alphabet; each decision tree yields 
the probability that a given letter will have a given pho- 
netic transcription, based on information about neigh- 
boring letters. A more complete description of the pres- 
ently preferred decision tree-based phoneticizer ap- 
pears later in this document. While decision tree tech- 
nology is presently preferred, other algorithmic or heu- 
ristic techniques may also be used. 
[001 7] Phoneticizer 1 4 generates at least one phonet- 
ic transcription, and optionally multiple phonetic tran- 
scriptions for the spelled word entry. The phoneticizer 
attaches a probability value or score to each letter to 
phoneme conversion, and these data may be used to 
rank the phonetic transcriptions in the order of the n- 
best, where n is an integer value. In one embodiment, 
phoneticizer 14 generates the n-best transcriptions and 
outputs this as a list to hybrid unit transcription module 
20. In an alternate embodiment phoneticizer 14 gener- 
ates a single phonetic transcription (e.g., the best tran- 
scription) and this transcription is then processed by an 
n-best transcription generator 18 that uses a confusion 
matrix 19 to generate a list of n-best phonetic transcrip- 
tions based on the single transcription provided by the 
phoneticizer. The confusion matrix consists of a 
prestored look-up table of frequently confused phonetic 
sounds. The generator 18 uses the confusion matrix to 
create multiple permutations of the original phonetic 
transcription by substituting sounds obtained from the 
confusion matrix. 

[0018] The hybrid unit transcription module 20 proc- 
esses the n-best phonetic transcriptions, converting 
these into hybrid unit transcriptions. The presently pre- 
ferred embodiment performs the phonetic-to-hybrid unit 
translation by first using the syllabafication procedure 
illustrated in Figure 2. The syllabification procedure re- 
sults in a list of the n-best syllabic transcriptions. The 
system consults dictionary 30 to determine whether 
each syllable in the syllabic transcription is found in the 
dictionary. If so, a stored code representing that syllable 
is substituted for the syllable. If not found, the syllable 
is further decomposed into its constituent sub-unit parts 
(e.g., demi-syllable or phonemes) and codes are select- 
ed from dictionary 30 to represent these parts. Thus the 
word is ultimately represented as hybrid units (a mixture 
of syllables, demi-syllables, phonemes, or other suitable 
sound units). These hybrid units are each represented 
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as codes looked up in dictionary 30. This storage tech- 
nique saves considerable space in the lexicon, while 
providing smooth transcriptions with good handling of 
co-articulated sounds for robust speech recognition. 
[0019] To further illustrate, a syllable may comprise 
one or more phonetic sounds. Thus the syllabic tran- 
scription is a more macroscopic representation than the 
phonetic transcription. If syllables alone were used to 
represent words, a comparatively large lexicon would 
result. It may take, for example, 1000 or more syllables 
to represent the majority of words in the English lan- 
guage. The small footprint embodiment of the invention 
avoids the large lexicon by representing words as hybrid 
units in which only the most frequently used syllables 
are retained; the less frequently used syllables are bro- 
ken into smaller units, such as demi-syllables or pho- 
nemes and these smaller units are used in place of 
those syllables. This provides a natural data compres- 
sion which contributes to the inventions ability to use 
memory efficiently. 

[0020] From the n-best hybrid unit transcriptions, the 
best transcription or n-best transcriptions are selected 
by module 22. One technique for extracting the best 
transcription at 22 is to use the user's voice input. The 
user simply speaks the name into the device 12 and 
module 22 matches the spoken input to the n-best tran- 
scriptions obtained via module 20 to select one or the 
n-best transcriptions. One advantage of this extraction 
technique is that the recognizer system inherently codes 
for that users voice. In effect, this results in a highly eco- 
nomical speaker adaptation in which entries placed in 
the lexicon are tuned to the user's voice. 
[0021] As an alternate to extraction by voice input, 
module 22 can be configured to work in conjunction with 
a rescoring mechanism 24 that assigns new probability 
scores to each transcription based on rules regarding 
phonetic information. Although not required, the rescor- 
ing mechanism can improve performance and repre- 
sents a desirable addition if memory and processor re- 
sources are available. The presently preferred rescoring 
mechanism uses decision trees 26, which may be mixed 
decision trees comprising questions based on letters 
and questions based on phonemes. The description of 
decision tree phoneticizers provided below explains one 
embodiment of such a mixed decision tree mechanism 
for rescoring. 

[0022] With the best transcription or n-best transcrip- 
tions having been selected, word template constructor 
28 then builds a highly compact representation of the 
word by using the dictionary 30. The dictionary repre- 
sents hybrid units as units that may be used by the pat- 
tern matching algorithm of the desired recognizer. Sim- 
ilarity-based units, such as units based on phone simi- 
larity are presently preferred because they can be ren- 
dered speaker-independent and because they are 
memory efficient. Hidden Markov Models can also be 
used to represent the hybrid units, although such repre- 
sentation involves greater complexity. 



[0023] Phone similarity representations of the hybrid 
units can be constructed in advance, using a suitable 
phoneme dictionary against which the hybrid units are 
compared to compute phone similarity. To make the sys- 

5 tern speaker-independent, the database may include 
many examples of each hybrid unit which are each com- 
pared with the phoneme dictionary to compute the sim- 
ilarity for each unit. The examples may be provided as 
training data. The results are then warped together, us- 

10 jng suitable dynamic time warping (DTW) algorithm, re- 
sulting in an "average" phone similarity representation 
for each hybrid unit. These average phone similarity pa- 
rameters or representations are then stored in dictionary 
30. While phone similarity-based representation is pres- 

15 ently preferred for its robustness and economy, other 
representations maybe used, including representations 
ranging from complex speaker-independent Hidden 
Markov Models to simple, less speaker-independent 
Linear Predictive Coding. 

20 [0024] The word template constructor builds a con- 
catenated string of phone similarity units corresponding 
to the hybrid units contained in the extracted transcrip- 
tion. This string is then stored in association with the 
spelled word in the lexicon, as illustrated diagrammati- 

25 cally by data structure 32. Data structure 32 contains 
spelled word entries 34 in association with strings 36. 
The data structure may also store other information, 
such as associated telephone numbers of parties rep- 
resented by the spelled words (names). 

30 [0025] Storing the spelled words 34 gives the system 
the ability to display the recognized word on the LCD 
display of the device 12. This provides a user friendly 
inexpensive feedback to assure the user that the system 
properly recognized his or her spoken entry. 

35 [0026] Referring next to Figure 2, the presently pre- 
ferred procedure for performing syllabification is illus- 
trated in steps 1-6. The reader may want to consult the 
examples reproduced below when reviewing the flow- 
chart of Figure 2. The examples illustrate different word 

40 entries and show what the syllabification algorithm does 
in each of the six numbered steps. Line numbers in the 
examples correspond to step numbers in Figure 2. In 
the examples, angled brackets <> are used to denote 
syllable boundaries and the percent symbol % is used 

45 to denote word boundaries. Numbers appearing after 
the phonemes correspond to the degree of stress ap- 
plied to that phoneme. The presently preferred phoneti- 
cizer 14 generates phonetic output at three stress lev- 
els, 0, 1 and 2. 

50 Referring to Figure 2, syllable boundaries are placed 
around each stress-bearing phoneme in step 1. Thus 
there will be a syllable for each phoneme with a number 
following it to indicate the stress level. Next, all intervo- 
calic velar nasals ("ng") are placed into codas. Coda re- 

55 fers to that portion of the syllable following the sonority 
peak of the syllable - usually a vowel - called the nucle- 
us. The velar nasal "ng" can only occur in codas in Eng- 
lish. Referring to line 2 in the first example, note that the 
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letters "ng" have been moved inside the angled brackets 
at the coda position, that is at the position following the 
nucleus. 

[0027] Next, in step 3, all intervocalic "s" and "sh" pho- 
nemes are placed into the corresponding onset posi- 5 
tions. Onset refers to that portion of the syllable preced- 
ing the nucleus. See for example line 3 in the second 
example presented below. In step 4 all unsyllabified "s" 
and "sh" phonemes that immediately follow stressed 
vowels are placed into codas. 10 
[0028] Step 5 then proceeds by optimizing the onsets 
with the remaining intervocalic non-syllabified material. 
All of the remaining intervocalic non-syllabified pho- 
nemes are tested to see if they can form an onset. This 
is done by comparing them with a list of possible onsets. 15 
If they can be made part of an onset, they are so placed 
at this time. If they cannot form part of an onset, then 
the procedure removes one phoneme from the begin- 
ning of the string until what remains can form a possible 
onset. The onset is established at that point and brings 20 
the end of the coda of the preceding syllable up to it. 
[0029] Finally, in step 6, the onset of the first syllable 
of the word is expanded to the beginning of the word, 
and the coda of the last syllable of the word is expanded 
to the end of the word. Steps 5 and 6 will affect most 25 
words, whereas steps 1- 4 affect only a limited subset. 
The following examples will now further illustrate. 

Examples: 

30 

[0030] 

Velar nasal put into coda in step 2. 
-bellingham #NAME; 

35 

bcl b eh1 | ihO ng axO m 

1 %bcl b <eh1> | <ihO> ng <axO> m% 

2 %bcl b <eh1> j <ihO ng> <axO> m% 

3 %bcl b <eh1> | <ihO ng> <axO> m% 

4 %bcl b <eh1> | <ihO ng> <axO> m% *o 

5 %bcl b <eh1> <| ihO ng> <axO> m% 

6 %<bcl b eh1> <| ihO ng> <axO> m% 

Intervocalic "s" put into onset in step 3. 

-absences # 45 

ae1 bcl b s enO s ihO z 

1 %<ae1> bcl b s <enO> s <ihO> z% 

2 %<ae1> bcl b s <enO> s <ihO> z% 

3 %<ae1> bcl b s <enO> <s ihO> z% so 

4 %<ae1> bcl b s <enO> <s ihO> z% 

5 %<ae1 bcl b> <s enO> <s ihO> z% 

Intervocalic "sh" put into onset in step 3. 

-abolitionist # 55 

ae2 bcl b axO | ih1 sh ihO n ihO s tcl t 

1 %<ae2> bcl b <axO> | <ih1> sh <ihO> n <ihO> 



s tcl t% 

2 %<ae2> bcl b <axO> | <ih1 > sh <ihO> n <ihO> 
s tcl t% 

3 %<ae2> bcl b <axO> | <ih1 > <sh ihO> n <ihO> 
s tcl t% 

4 %<ae2> bcl b <axO> | <ih1 > <sh ihO> n <ihO> 
s tcl t% 

5 %<ae2> <bcl b axO> <| ih1 > <sh ihO> <n ihO> 
s tcl t% 

6 %<ae2> <bcl b axO> <| ih1> <sh ihO> <n ihO 
s tcl t>% 

Unsyllabified "s" put into coda after stressed vowel 
in step 4. 
-abasement # 

axO bcl b ey1 s m ihO n tcl t 

1 %<axO> bcl b <ey1> s m <ihO> n tcl t% 

2 %<axO> bcl b <ey1> s m <ihO> n tcl t% 

3 %<axO> bcl b <ey1> s m <ihO> n tcl t% 

4 %<axO> bcl b <ey1 s> m <ihO> n tcl t% 

5 %<axO> <bcl b ey1 s> <m ihO> n tcl t% 

6 %<axO> <bcl b ey1 s> <m ihO n tcl t>% 

Unsyllabified "sh" put into coda after stressed vowel 
in step 4. 

-Cochrane #/NAME; 

kcl k owl sh r ey2 n 

1 %kcl k <ow1> sh r <ey2> n% 

2 %kcl k <ow1> sh r <ey2> n% 

3 %kcl k <ow1> sh r <ey2> n% 

4 %kcl k <ow1 sh> r <ey2> n% 

5 %kcl k <ow1 sh> <r ey2 > n% 

6 %<kcl k owl sh> <r ey2 n>% 

The Decision Tree Phoneticizer 

[0031] The presently preferred phoneticizer is a pro- 
nunciation generator which employs two stages. The 
first stage employs a set of letter-only decision trees 110 
and the second stage employs a set of mixed-decision 
trees 1 1 2. An input sequence 114, such as the sequence 
of letters B-l-B-L-E, is fed to a dynamic programming 
phoneme sequence generator 116. The sequence gen- 
erator uses the letter-only trees 110 to generate a list of 
pronunciations 118, representing possible pronuncia- 
tion candidates of the spelled word input sequence. 
[0032] The sequence generator sequentially exam- 
ines each letter in the sequence, applying the decision 
tree associated with that letter to select a phoneme pro- 
nunciation for that letter based on probability data con- 
tained in the letter-only tree. 

[0033] Preferably the set of letter-only decision trees 
includes a decision tree for each letter in the alphabet. 
Figure 4 shows an example of a letter-only decision tree 
for the letter E. The decision tree comprises a plurality 
of internal nodes (illustrated as ovals in the Figure) and 
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a plurality of leaf nodes (illustrated as rectangles in the 
Figure). Each internal node is populated with a yes-no 
question. Yes-no questions are questions that can be 
answered either yes or no. In the letter-only tree these 
questions are directed to the given letter (in this case 
the letter E) and its neighboring letters in the input se- 
quence. Note in Figure 3 that each internal node branch- 
es either left or right depending on whether the answer 
to the associated question is yes or no. 
[0034] Abbreviations are used in Figure 4 as follows: 
numbers in questions, such as VP or "-P refer to po- 
sitions in the spelling relative to the current letter For 
example, "+1 L== , R , ?° means "Is the letter after the cur- 
rent letter (which in this case is the letter E) an R?" The 
abbreviations CONS and VOW represent classes of let- 
ters, namely consonants and vowels. The absence of a 
neighboring letter, or null letter, is represented by the 
symbol -, which is used as a filler or placeholder where 
aligning certain letters with corresponding phoneme 
pronunciations. The symbol # denotes a word boundary. 
[0035] The leaf nodes are populated with probability 
data that associate possible phoneme pronunciations 
with numeric values representing the probability that the 
particular phoneme represents the correct pronuncia- 
tion of the given letter. For example, the notation 
"iy=>0.51" means "the probability of phoneme 'iy' in this 
leaf is 0.51." The null phoneme, i.e., silence, is repre- 
sented by the symbol 

[0036] The sequence generator 116 (Fig. 3) thus uses 
the letter-only decision trees 110 to construct one or 
more pronunciation hypotheses that are stored in list 
118. Preferably each pronunciation has associated with 
it a numerical score arrived at by combining the proba- 
bility scores of the individual phonemes selected using 
the decision tree 110. Word pronunciations may be 
scored by constructing a matrix of possible combina- 
tions and then using dynamic programming to select the 
n-best candidates. Alternatively, the n-best candidates 
may be selected using a substitution technique that first 
identifies the most probable word candidate and then 
generates additional candidates through iterative sub- 
stitution, as follows. 

[0037] The pronunciation with the highest probability 
score is selected first, by multiplying the respective 
scores of the highest-scoring phonemes (identified by 
examining the leaf nodes) and then using this selection 
as the most probable candidate or first-best word can- 
didate. Additional (n-best) candidates are then selected 
by examining the phoneme data in the leaf nodes again 
to identify the phoneme, not previously selected, that 
has the smallest difference from an initially selected 
phoneme. This minimally-different phoneme is then 
substituted for the initially selected one to thereby gen- 
erate the second-best word candidate. The above proc- 
ess may be repeated iteratively until the desired number 
of n-best candidates have been selected. List 118 may 
be sorted in descending score order, so that the pronun- 
ciation judged the best by the letter-only analysis ap- 



pears first in the list. 

[0038] As noted above, a letter-only analysis will fre- 
quently produce poor results. This is because the letter- 
only analysis has no way of determining at each letter 

5 what phoneme will be generated by subsequent letters. 
Thus a letter-only analysis can generate a high scoring 
pronunciation that actually would not occur in natural 
speech. For example, the proper name, Achilles, would 
likely result in a pronunciation that phoneticizes both IPs: 

10 ah-k-ih-l-l-iy-z. In natural speech, the second I is actu- 
ally silent: ah-k-ih-l-iy-z. The sequence generator using 
letter-only trees has no mechanism to screen out word 
pronunciations that would never occur in natural 
speech. 

15 [0039] The second stage of the pronunciation system 
addresses the above problem. A mixed-tree score esti- 
mator 120 uses the set of mixed-decision trees 112 to 
assess the viability of each pronunciation in list 118. The 
score estimator works by sequentially examining each 
letter in the input sequence along with the phonemes 
assigned to each letter by sequence generator 116. 
[0040] Like the set of letter-only trees, the set of mixed 
trees has a mixed tree for each letter of the alphabet. 
An exemplary mixed tree is shown in Figure 5. Like the 
letter-only tree, the mixed tree has internal nodes and 
leaf nodes. The internal nodes are illustrated as ovals 
and the leaf nodes as rectangles in Figure 5. The inter- 
nal nodes are each populated with a yes-no question 
and the leaf nodes are each populated with probability 
data. Although the tree structure of the mixed tree re- 
sembles that of the letter-only tree, there is one impor- 
tant difference. The internal nodes of the mixed tree can 
contain two different classes of questions. An internal 
node can contain a question about a given letter and its 
neighboring letters in the sequence, or it can contain a 
question about the phoneme associated with that letter 
and neighboring phonemes corresponding to that se- 
quence. The decision tree is thus mixed, in that it con- 
tains mixed classes of questions. 
[0041] The abbreviations used in Figure 5 are similar 
to those used in Figure 4, with some additional abbre- 
viations. The symbol L represents a question about a 
letter and its neighboring letters. The symbol P repre- 
sents a question about a phoneme and its neighboring 
phonemes. For example the question "+^='0'?" 
means "Is the letter in the +1 position a *D7" The abbre- 
viations CONS and SYL are phoneme classes, namely 
consonant and syllabic. For example, the question 
■+1P=CONS?" means "Is the phoneme in the +1 po- 
sition a consonant?" The numbers in the leaf nodes give 
phoneme probabilities as they did in the letter-only 
trees. 

[0042] The mixed-tree score estimator rescores each 
of the pronunciations in list 118 based on the mixed-tree 
questions and using the probability data in the lead 
nodes of the mixed trees. If desired, the list of pronun- 
ciations may be stored in association with the respective 
score as in list 122. If desired, list 122 can be sorted in 
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descending order so that the first listed pronunciation is 
the one with the highest score. 
[0043] In many instances the pronunciation occupy- 
ing the highest score position in list 122 will be different 
from the pronunciation occupying the highest score po- 
sition in list 118. Thjs occurs because the mixed-tree 
score estimator, using the mixed trees 112, screens out 
those pronunciations that do not contain self-consistent 
phoneme sequences or otherwise represent pronunci- 
ations that would not occur in natural speech. 
[0044] If desired a selector module 124 can access 
list 122 to retrieve one or more of the pronunciations in 
the list. Typically selector 124 retrieves the pronuncia- 
tion with the highest score and provides this as the out- 
put pronunciation 126. 

A Hybrid Unit Word Recognizer 

[0045] The similarity-based hybrid unit representation 
lends itself well to compact speech recognizers, suitable 
for a variety of consumer applications. Input speech 
supplied to the recognizer is compared with entries in 
the lexicon using a pattern matching algorithm. A dy- 
namic time warping (DTW) algorithm may be used for 
example. 

[0046] To accommodate possible variation in stress 
or speed at which syllables within a spelled word may 
be spoken, the system employs a set of rules to com- 
press or expand the duration of certain hybrid units. The 
syllables within long spelled words are sometimes pro- 
nounced rapidly. This information may be added to the 
lexicon, for example. The recognizer can then use a pri- 
ori knowledge about the length of spelled words — ob- 
tained by counting the number of letters in the spelled 
word, for example — to better match spoken input to the 
proper lexicon entry. 

[0047] Other techniques for incorporating a priori 
knowledge of variation in pronunciation include applying 
weights to the more reliable hybrid unit information in 
the lexicon. The boundaries of hybrid units may be less 
reliable than the center frames. The pattern matching 
algorithm may therefore weight the center frames more 
heavily than the boundaries, thus emphasizing the most 
reliable parts of the hybrid units. 



Claims 

1. A speech recognizer having a lexicon updateable 
by spelled word input, comprising: 

a phoneticizer for generating a phonetic tran- 
scription of said spelled word input; 
a hybrid unit generator receptive of said pho- 
netic transcription for generating at least one 
hybrid unit representation of said spelled word 
input based on said phonetic transcription; and 
a word template constructor that generates for 



said spelled word a sequence of symbols indic- 
ative of said hybrid unit representation for stor- 
ing in said lexicon. 

5 2. The speech recognizer of claim 1 wherein said pho- 
neticizer includes a set of decision trees that identify 
different phoneme transcriptions corresponding to 
letters of an alphabet. 

10 3. The speech recognizer of claim 1 further comprising 
a multiple phonetic transcription generator that con- 
verts said phonetic transcription into an n-best plu- 
rality of phonetic transcriptions. 

is 4. The speech recognizer of claim 3 wherein said mul- 
tiple phonetic transcription generator includes a 
confusion matrix that stores different phoneme tran- 
scriptions for confusable letters of an alphabet. 

20 5. The speech recognizer of claim 1 wherein said pho- 
neticizer generates one phonetic transcription and 
said speech recognizer further comprises a multiple 
phonetic transcription generator that converts said 
one phonetic transcription into an n-best plurality of 
25 phonetic transcriptions. 

6. The speech recognizer of claim 1 wherein said pho- 
neticizer generates an n-best plurality of phonetic 
transcriptions. 

30 

7. The speech recognizer of claim 1 wherein said hy- 
brid unit generator generates a plurality of hybrid 
unit representations of said spelled word. 

35 8. The speech recognizer of claim 7 further comprising 
scoring processor for applying a score to each of 
said plurality of hybrid unit representations and for 
selecting at least one of said plurality of hybrid unit 
representations to be provided to said word tem- 
40 plate constructor based on said score. 

9. The speech recognizer of claim 8 wherein said scor- 
ing processor includes a set of decision trees that 
apply different scores to different phoneme tran- 

45 scriptions. 

10. The speech recognizer of claim 1 further comprising 
speech data input for providing pronunciation infor- 
mation about said spelled word. 

50 

11. The speech recognizer of claim 10 wherein said 
speech data input comprises voice input for supply- 
ing pronunciation information based on speech 
supplied by a user. 

55 

1 2. The speech recognizer of claim 1 0 wherein said hy- 
brid unit generator generates a plurality of hybrid 
unit representations of said spelled word; and 
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further comprising scoring processor for se- 
lecting one of said plurality of hybrid unit represen- 
tations to be provided to said word template con- 
structor based on said speech data. 

13. The speech recognizer of claim 1 wherein said word 
template constructor includes a dictionary contain- 
ing similarity-based representation of said hybrid 
units. 

14. The speech recognizer of claim 1 wherein said pho- 
neticizer includes a memory for storing spelling-to- 
pronunciation data comprising: 



a decision tree data structure stored in said 15 
memory that defines a plurality of internal 
nodes and a plurality of leaf nodes, said internal 
nodes adapted for storing yes-no questions 
and said leaf nodes adapted for storing proba- 
bility data; 20 
a first plurality of said internal nodes being pop- 
ulated with letter questions about a given letter 
and its neighboring letters in said spelled word 
input; 

a second plurality of said internal nodes being 25 
populated with phoneme questions about a 
phoneme and its neighboring phonemes in said 
spelled word input; 

said leaf nodes being populated with probability 
data that associates said given letter with a plu- 30 
rality of phoneme pronunciations. 



15. The speech recognizer of claim 1 wherein said hy- 
brid units are represented as similarity parameters. 

35 

16. The speech recognizer of claim 1 wherein said hy- 
brid units are represented as phone similarity pa- 
rameters based on an average similarity derived 
from a plurality of training examples. 

40 

17. The speech recognizer of claim 1 further comprising 
hybrid unit duration modification rules for expanding 
or compressing duration of selected hybrid units 
based on length of said spelled word. 

45 

18. The speech recognizer of claim 1 further comprising 
pattern matching mechanism for comparing a 
voiced input to said lexicon, said pattern matching 
mechanism having weighting mechanism for in- 
creasing the importance of selected portions of said so 
hybrid units during pattern matching. 
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